Harnessing Semi-Supervised Machine Learning to Automatically Predict Bioactivities of Per- and Polyfluoroalkyl Substances (PFASs)

Many per- and polyfluoroalkyl substances (PFASs) pose significant health hazards due to their bioactive and persistent bioaccumulative properties. However, assessing the bioactivities of PFASs is both time-consuming and costly due to the sheer number and expense of in vivo and in vitro biological experiments. To this end, we harnessed new unsupervised/semi-supervised machine learning models to automatically predict bioactivities of PFASs in various human biological targets, including enzymes, genes, proteins, and cell lines. Our semi-supervised metric learning models were used to predict the bioactivity of PFASs found in the recent Organisation of Economic Co-operation and Development (OECD) report list, which contains 4730 PFASs used in a broad range of industries and consumers. Our work provides the first semi-supervised machine learning study of structure–activity relationships for predicting possible bioactivities in a variety of PFAS species.


Data Collection
Our selection of proteins focuses on targets described previously in Cheng et al., 1 and we screened the targets based on the availability of X-ray structures for our docking simulations (see Table S1).The bioactivity data from bioassays against 26 targets were extracted from the Supplementary Information of ref 1.

Discussion of Unsupervised Learning Methods
For the first unsupervised learning step, we used dimension reduction methods to project the high-dimensional fingerprint input data onto a 2-dimensional space.Two different dimension reduction methods, (i) Principal Component (PC) Analysis (PCA) followed by t-Distributed Stochastic Neighbor Embedding (t-SNE), i.e., PC t-SNE, and (ii) UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction), 2 were used on our fingerprint data.While t-SNE is a widely-used dimension reduction technique for many types of data analysis, prior studies have shown that UMAP exhibits better clustering performance, especially for large datasets. 2Thus, we decided to compare the clustering performance of both techniques.
After the dimension reduction procedure, we used the scikit-learn 3

Model Selection
The final models were selected based on the best Silhouette score, which analyzes the distances of each data point to its cluster and neighboring clusters.In short, a higher Silhouette score guarantees better performance in clustering.The Silhouette score () of a data   can be calculated by the following expression: where () is the average distance from the  ℎ point to the other points in its cluster, and (, ) is the average distance from the  ℎ point to points in another cluster .The Silhouette score was also used as an essential statistical method to optimize hyperparameters. 4 Brief Discussion of Semi-Supervised Metric Learning Deep metric learning (DML) is a widely used machine learning technique to investigate mixed data distributions before building prediction models.DML approaches have been accurately used in state-of-the-art computer vision technologies, [5][6][7] and recently, Na et al. reported their usefulness in cheminformatics. 8The central concept of DML is to make a prediction problem easy by creating a new vector representation and separating the data well depending on their target values (cf. Figure 2 in the main text).
The metric-learning algorithms compute/learn Mahalanobis distances, which are given by the expression: Based on the semi-supervised data, the metric learning problem is generally formulated as an optimization of Mahalanobis distances.That is, the metric-learning algorithm seeks to find the parameters of a distance function that optimizes some objective function measuring the agreement with the training data.In other words, we used metric-learning with a semi-supervised learning algorithm to learn a distance metric that places molecules with similar bioactivities close together and molecules with opposite bioactivities far away.

Molecular Docking Calculations
After identifying certain substructures that play a critical role in determining bioactivities, we simulated their interaction in the binding pocket of the targets to further probe the fundamental interactions giving rise to their enhanced bioactivity.To this end, AutoDock4 [9][10][11] and Autodock Vina (v1.5.6) 12 were used to carry out the molecular docking procedure and visualize the docked conformations of the ligand-protein complexes.
The structures of the proteins for the docking analysis were obtained from the RCSB PDB database (htts://www.rcsb.org/).The other three-dimensional structures were constructed from SMILES using Open Babel (v2.4.0). 13Autodock4 was used to process the ligand and ligand interaction conformation analysis in the binding affinity analysis.The docking pockets were based on the global search of the protein to determine the optimal binding sites. 10toDock4 uses a semiempirical free-energy forcefield scoring function to perform a quick and accurate evaluation of the binding energies of ligands to proteins using a two-step approach.First, the intramolecular energetics of the transition from the unbound state to the bound form of the proteinligand complex is estimated, and the intermolecular energetics of the bound complex are subsequently evaluated.

Unsupervised Machine Learning Results
In this section, we compare the performance of each machine learning algorithm in classifying/clustering the PFAS molecules based on their bioactivity (see Table S2 for their parameters and settings.)The parameters for each model are optimized for the best performance (i.e., the highest Silhouette score).As mentioned in the methods section, we tested six unsupervised learning models by combining two different dimension reduction methods (PC t-SNE and UMAP) with three different clustering methods (k-means, DBSCAN, and HDBSCAN).Table S2 shows that the combination of kmeans clustering and the UMAP model exhibited better performance (with a Silhouette score of 0.577) than the other machine-learning methods.
Figure S1 displays the distribution based on the molecular structures of C3F6 datasets.First, the molecular structures were converted into constant-length arrays by ECFP.Next, the vector representations were projected onto a 2-dimensional space using the PC t-SNE dimension reduction methods (molecules that are grouped closer together indicate structural similarity).These molecules were subsequently clustered into ten groups using k-means clustering.We then used the bioactivity data from each target to analyze the molecules with respect to promising substructures leading to bioactivity.We found that 90.6%, 76.9%, and 70.0% of molecules in cluster 1 are bioactive on CYP2C9, CYP2D6, and CYP3A4, respectively.CYP2C9, CYP2D6, and CYP3A4 are members of Cytochrome p450 enzymes (Cyps) involved in metabolism processes by oxidizing xenobiotics in the human body.
Cyps are major phase-I xenobiotic-metabolizing enzymes induced by many environmental xenobiotics and drugs. 14Also, 42.9% of molecules in cluster 4 are bioactive on ATXN, a DNA-binding protein. 15,16 only show targets with the top 4 true-positive rates of 90.6%, 76.9%, 70.0%, and 42.9%.The true positive rate is the probability that an actual positive (bioactive molecule) will be predicted positive (bioactive).Table S3 shows the maximum common substructures of the clusters.
Table S3: Cluster number, accuracy, and maximum common structure most likely to be found in bioactive molecules toward each target.Most molecules with 1,3-bis(trifluoromethyl) benzene are bioactive toward the CYP enzymes, while molecules with 1-ethyl-3,5-bis(trifluoromethyl) benzene are likely to be bioactive toward ATXN.Furthermore, the larger CF dataset (containing 62,043 molecules) did not have a higher clustering performance than the smaller C3F6 dataset (containing 1,012 molecules) since traditional unsupervised learning clustering methods, such as k-means, work better when applied to smaller datasets. 17Because the CF dataset is 50 times larger than the C3F6 dataset and screened based on less rigorous criteria, we expect it to be more challenging to handle with unsupervised learning techniques.
For example, Figure SI-2a shows the bioactivities of molecules in the CF dataset toward keratin18 (K18), an intermediate filament protein. 18Our unsupervised machine learning failed to successfully cluster the PFAS molecules based on their bioactivity toward K18, demonstrating a truepositive rate of only 55.9%.Similarly, Figure SI-2c shows the bioactivity toward CYP2C9 predicted with unsupervised learning, in which only 4.0% of the cluster showed bioactivity.These results highlight the inherent limitation of purely unsupervised learning for molecular structures.Since bioactivities are sensitive to minimal changes in molecular structures, they exhibit a mixed distribution in the molecular structure space.[21] To overcome the limitations of PC t-SNE and UMAP, we utilized semi-supervised metric learning to produce a more distinct separation between bioactive and inactive molecules toward K18, which can be seen in It is not surprising that semi-supervised metric learning demonstrated significantly higher performance for PFAS QSAR classifications since it uses partially labeled data.In contrast, an unsupervised learning model takes unlabeled data as input.Furthermore, the performance difference is more significant with larger data, considering that traditional unsupervised learning methods, including k-means clustering, work better when applied to a smaller dataset. 17Semi-supervised metric learning uses partially labeled bioactivity data as input data, placing molecules with similar bioactivities closer and opposite bioactivities further away.On the other hand, unsupervised learning tries to predict bioactivities solely based on the chemical structure, making it extremely difficult to complete an appropriate QSAR task.Table S4: Predicted substructures that induce bioactivity toward each target, based on semi-supervised metric learning.
library to execute three different clustering methods: k-means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Hierarchical DBSCAN (HDBSCAN).While k-means clustering is relatively more time-efficient, DBSCAN and HDBSCAN can efficiently handle outliers and noisy datasets.Since each method has clear advantages, we used all three techniques and evaluated their performances.By combining dimension reduction and clustering methods, we classified/grouped various PFAS structures based on their similarities.Two different dimension reduction methods, PCA followed by t-SNE and UMAP 2 were used on the fingerprint data we generated.Three different clustering methodsk-means, DBSCAN, and HDBSCANwere used after dimension reduction.Our work utilized the PCA/PC, t-SNE, k-means, DBSCAN, and HDBSCAN algorithms in the scikit-learn 3 library.

Figure S1 :
Figure S1: Clustering of molecules in the C3F6 dataset.PFAS molecules are represented on a 2dimensional latent space using the PC t-SNE algorithm.Each point represents a molecule, and the colors of the points designate clusters classified using k-means clustering.
Figure SI-2b, which gives a true-positive rate of 79.2%.Similarly, Figure SI-2c shows the bioactivity toward CYP2C9 predicted by unsupervised learning in which, again, only 4.0% of the cluster showed bioactivity.Finally, Figure SI-2d shows the clustering based on semi-supervised metric learning, where 99.7% of a cluster's molecules are bioactive (i.e., a true positive rate of 99.7%).

Figure S2 :
Figure S2: Clustering of molecules in the CF dataset.Each point represents a molecule that is either bioactive (red) or inactive (blue) towards (a, b) K18 and (c, d) CYP2C9.The molecules are visualized on a 2-dimensional space using (a, c) PC t-SNE (unsupervised) or (b, d) semi-supervised metric learning.

Figure S3 :
Figure S3: Representative molecules found in the CheMBL database that are bioactive on CYP2C9.The CheMBL IDs of the molecules are (a) CHEMBL1411743, (b) CHEMBL1411201, and (c) CHEMBL1332759.The yellow circles represent the functional groups resulting in a structural alert of each molecule.

Figure S4 :
Figure S4: Histogram of binding energies between molecules in the CF dataset and CYP2C9.The histogram bins are organized by binding energies, where each bin represents the number of bioactive (red) and inactive (blue) molecules.

Table S2 :
Optimized hyperparameters and Silhouette scores for various unsupervised machine learning models.